-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get index of first non ascii byte #39506
Conversation
458145e
to
613caec
Compare
Tagging subscribers to this area: @tannergooding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two nits. Don't let it block you.
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Outdated
Show resolved
Hide resolved
Updated Perf Results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using incorrect GetNonAsciiBytes()
. For your purpose, you need GetIndexOfFirstNonAsciiByte()
.
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.Json/src/System/Text/Json/JsonEncodedText.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(undoing previous approval since many recent changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(undoing previous approval since many recent changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(undoing previous approval since many recent changes)
GitHub really doesn't make it easy to remove that green check mark. :( |
24eacf8
to
5d33b49
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visibility seems wrong for something that is consumed from private methods?
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
a2ea59c
to
8f259e6
Compare
9cbe471
to
0152f82
Compare
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
Debug.Assert(Sse2.IsSupported || AdvSimd.Arm64.IsSupported, "Sse2 or AdvSimd64 required."); | ||
Debug.Assert(BitConverter.IsLittleEndian, "This SSE2/Arm64 implementation assumes little-endian."); | ||
|
||
Vector128<byte> bitmask = BitConverter.IsLittleEndian ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the bitmask is intentionally defined as a local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You throw PlatformNotSupportedException()
inside GetIndexOfFirstNonAsciiByteInLane()
if !BitConverter.IsLittleEndian
. Why are you creating the bitMask
for that case to begin with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the bitmask is intentionally defined as a local?
Yup. Declaring it static won't worry because of initializing issues in either .NET Standard or .NET Framework (I don't remember which. I just remember @carlossanlop's conclusion)
Why are you creating the bitMask for that case to begin with?
Yup, the idea is that this part will just work if we start supporting BigEndian environments
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
} | ||
|
||
FoundNonAsciiDataInCurrentMask: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this label, it seems like duplicate computations are being made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, not really. Both the SSE and AdvSimd paths can come here, so we need to special case this label for both paths
{ | ||
goto FoundNonAsciiDataInCurrentMask; | ||
currentSseMaskOrAdvSimdIndex = (uint)Sse2.MoveMask(Sse2.LoadVector128(pBuffer)); // unaligned load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currentSseMaskOrAdvSimdIndex [](start = 16, length = 28)
I remember adding a comment about this variable. I think you should use different variables to represent this value for SSE2
and AdvSimd
. It is little confusing and hard to maintain with future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I think this is a reasonable compromise. More info here where I replied to your question :) #39506 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, somehow I missed it. I have responded to your comment in the same thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, fixed now.
Do you mind sending a PR to |
Here's the local benchmark I was using dotnet/performance#1445 |
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Just added some minor comments.
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
|
||
// calculate the index | ||
int index = BitOperations.TrailingZeroCount(mask) >> 2; | ||
Debug.Assert((mask != 0) ? index < 16 : index >= 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a way how to find an index of a first non-ascii byte in Vector128<byte> value
that doesn't require bitmask
.
Here is a sketch how this can be accomplished.
// value: Vector128<byte>
// Vn.B[15] : [ b15 * * * * * * * ]
// Vn.B[14] : [ b14 * * * * * * * ]
Vector128<ushort> value2 = value.AsUInt16();
// Vn.H[7] : [ b15 * * * * * * * b14 * * * * * * * ]
// Vn.H[6] : [ b13 * * * * * * * b12 * * * * * * * ]
Vector128<ushort> shiftToRightAndInsert = AdvSimd.ShiftRightAndInsert(value2, value2, 9);
// Vn.H[7] : [ b15 * * * * * * * b14 b15 * * * * * * ]
// Vn.H[6] : [ b13 * * * * * * * b12 b13 * * * * * * ]
Vector64<ushort> shiftToRightAndInsert2 = AdvSimd.ExtractNarrowingLower(shiftToRightAndInsert).AsUInt16();
// Vn.B[7] : [ b14 b15 * * * * * * ]
// Vn.B[6] : [ b12 b13 * * * * * * ]
// Vn.H[3] : [ b14 b15 * * * * * * b12 b13 * * * * * * ]
// Vn.H[2] : [ b10 b11 * * * * * * b08 b09 * * * * * * ]
Vector64<ushort> shiftToRightAndInsert3 = AdvSimd.ShiftRightAndInsert(shiftToRightAndInsert2, shiftToRightAndInsert2, 10);
// Vn.H[3] : [ b14 b15 * * * * * * b12 b13 b14 b15 * * * * ]
// Vn.H[2] : [ b10 b11 * * * * * * b08 b09 b10 b11 * * * * ]
Vector64<ushort> shiftToRightAndInsert4 = AdvSimd.ExtractNarrowingLower(shiftToRightAndInsert3.ToVector128Unsafe()).AsUInt16();
// Vn.B[3] : [ b12 b13 b14 b15 * * * * ]
// Vn.B[2] : [ b08 b09 b10 b11 * * * * ]
// Vn.H[1] : [ b12 b13 b14 b15 * * * * b08 b09 b10 b11 * * * * ]
// Vn.H[0] : [ b04 b05 b06 b07 * * * * b00 b01 b02 b03 * * * * ]
Vector64<uint> shiftToRightAndInsert5 = AdvSimd.ShiftRightAndInsert(shiftToRightAndInsert4, shiftToRightAndInsert4, 12).AsUInt32();
// Vn.H[1] : [ b12 b13 b14 b15 * * * * b08 b09 b10 b11 b12 b13 b14 b15 ]
// Vn.H[0] : [ b04 b05 b06 b07 * * * * b00 b01 b02 b03 b04 b05 b06 b07 ]
// Vn.S[0] : [ b12 b13 b14 b15 * * * * b08 b09 b10 b11 b12 b13 b14 b15 b04 b05 b06 b07 * * * * b00 b01 b02 b03 b04 b05 b06 b07 ]
Vector64<ushort> shiftToLeftAndInsert = AdvSimd.ShiftLeftAndInsert(shiftToRightAndInsert5, shiftToRightAndInsert5, 24).AsUInt16();
// Vn.S[0] : [ b00 b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12 b13 b14 b15 b04 b05 b06 b07 * * * * b00 b01 b02 b03 b04 b05 b06 b07 ]
// Vn.H[1] : [ b00 b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12 b13 b14 b15 ]
ushort index = AdvSimd.LeadingZeroCount(shiftToLeftAndInsert).GetElement(1);
Here is a link to proof-of-concept code: https://gist.github.com/echesakovMSFT/b27ed28024091472db6d3ca007f34a6d#file-program-cs
There code that JIT generates for the part computing the index as follows:
4EB01E11 mov v17.16b, v16.16b
6F174611 sri v17.8h, v16.8h, #9
0E212A30 xtn v16.8b, v17.8h
0EB01E11 mov v17.8b, v16.8b
2F164611 sri v17.4h, v16.4h, #10
0EB11E30 mov v16.8b, v17.8b
0E212A10 xtn v16.8b, v16.8h
0EB01E11 mov v17.8b, v16.8b
2F144611 sri v17.4h, v16.4h, #12
0EB11E30 mov v16.8b, v17.8b
2F385630 sli v16.2s, v17.2s, #24
2E604A10 clz v16.4h, v16.4h
0E063E00 umov w0, v16.h[1]
The five mov
s are redundant and shouldn't be there - I am working on analysing why JIT emits them (I suspect this is how we handle BuildDelayFreeUses
). Theoretically, the code should contain only 3 x sri
, 2 x xtn
, 1 x sli
, 1 x clz
and 1 x umov
instructions.
@TamarChristinaArm Do you see any issue with this approach? Aside from only supporting little-endian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code also allows to compute mask non-ascii bytes by replacing
ushort index = AdvSimd.LeadingZeroCount(shiftToLeftAndInsert).GetElement(1);
with
ushort mask = shiftToLeftAndInsert.GetElement(1);
Although, the mask will be reversed in this case mask.15
correspond to lane[0].7
and mask.0
corresponds to lane[15].7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually like to get this PR in soon, considering our deadline is Friday. I'm thinking I'll file a follow-up issue to investigate this improvement and the improvements suggested in #39507. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does that sound?
@pgovind Sure, to clarify I was not suggesting to incorporate the algorithm right now. I understand that this would be an unreasonable ask.
But I can explain motivation and how I discovered this approach. I was trying to find a way to avoid doing this in your code
ulong mask = extractedBits.AsUInt64().ToScalar();
// calculate the index
int index = BitOperations.TrailingZeroCount(mask) >> 2;
since a call to BitOperations.TrailingZeroCount(mask)
is not free and use AdvSimd.LeadingZeroCount
instead.
It turns out that I ended up with a completely different approach.
We can follow up on this in .NET 6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echesakovMSFT In and out of itself the TrailingZeroCount
isn't that bad because the integer version of clz
is twice as fast as the simd version of clz
in your example. so the rbit+clz
on the genreg side is about as efficient as the clz
on the SIMD side from your example (looking at a Neoverse-N1).
The example you have, while it avoids having to load the mask ends up with 7 instructions that are 2 cycles each and the sri
and sli
go down the same pipe so they can't be executed in parallel.
so I think if you amortize the cost of the mask that sequence will end up being slower.
That said, you can improve this sequence further. The issue here isn't the use of a mask, it's the use of a mask that can't be created in register. Instead take a look at the sequence I pasted #39507 (comment) which also avoid the TrailingZeroCount
(on big endian) by using a different mask (which is also created without requiring a load). That is a sequence of 6 cycles and only blocks V1
once.
But looking at the overall algorithm, the majority of the cases don't require the actual index yet (with which I mean, you only check if it contains it or not...). The helper function should really end at the .ToScalar();
and ContainsNonAsciiByte_AdvSimd
should really just be != 0
.
Also what is the expected loop iterations? if you expect the majority of the characters to be ascii then the implementation is not really taking advantage of that.
currentAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(firstVector, bitmask);
secondAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(secondVector, bitmask);
if (ContainsNonAsciiByte_AdvSimd(currentAdvSimdIndex) || ContainsNonAsciiByte_AdvSimd(secondAdvSimdIndex))
You can test both vector together without needing a mask. Essentially what you want is a slightly modified version of strlen https://github.com/ARM-software/optimized-routines/blob/224cb5f67b71757b99fe1e10b5a437c17a1d733c/string/aarch64/strlen.S#L144
ldp currentAdvSimdIndex, secondAdvSimdIndex, [src, 32]!
sminp maskv.16b, currentAdvSimdIndex.16b, secondAdvSimdIndex.16b
sminp maskv.16b, maskv.16b, maskv.16b
fmov synd, maskd
tst synd, 0x8080808080808080
now when synd
test succeeds so you know you have an element there, you have to find in which vector the match is.
You can easily do this by testing the lowpart of synd
, so if synd
is x1
you test w1
using tst w1, 0x80808080
if that succeeds you know it's in the currentAdvSimdIndex
else in secondAdvSimdIndex
.
after this you implement the index finding part
sshr v1.16b, v1.16b, 7
bic v1.8h, 0x0f, lsl 8
umaxp v1.16b, v1.16b, v1.16b
fmov x1, d1
rbit x0, x0
clz x0, x1
Now you still need the >> 2
but at the point you do this you're about to add to the offset:
pBuffer += currentAdvSimdIndex;
Which should allow the JIT to lower the >> 2
into the add
as
add len, len, x0, lsr 2
This avoid you needing the mask (does need some minor tweaking for big-endian) but should be significantly smaller than what is currently being generated and should be a couple of orders faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TamarChristinaArm Thank you a lot for your detailed expanation! Going forward we should definitely capture your analysis here and in other PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, Also FYI for maximum throughput instead of sshr v1.16b, v1.16b, 7
you can use interpret it as a signed number and instead do
cmlt v0.16b, v0.16b, #0
Which has the same latency as sshr
but isn't restricted to one pipe. That's a slightly better sequence than what I pasted in #39507 (comment) and above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you do need an rbit
above for little endian, I'll update the comments. for this case an LD1
with two regs may be faste than an LDP
but that needs some benchmarking, it does make it easier to handle the endian differences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, Also FYI for maximum throughput instead of
sshr v1.16b, v1.16b, 7
you can use interpret it as a signed number and instead docmlt v0.16b, v0.16b, #0
Which has the same latency as
sshr
but isn't restricted to one pipe. That's a slightly better sequence than what I pasted in #39507 (comment) and above.
Actually, we haven't had a chance to finish #33972 in .NET 5 - so, in this case, using Vector64/128<T>.Zero
would require to zero a register and use 3-reg form of cmlt
- this is why I suggested to use sshr
at the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we haven't had a chance to finish #33972 in .NET 5 - so, in this case, using Vector64/128.Zero would require to zero a register and use 3-reg form of cmlt - this is why I suggested to use sshr at the first place.
Hmm well you can declare the use of Vector64/128<T>.Zero
to outside the loop so you pay for it only once, and assuming you don't run out of registers it should still be better as sshr
is restricted to V1
while cmlt
can go in any pipe. So in a non-micro benchmark the cmlt
sequence would likely finish earlier.
Latest perf numbers on my machine running WSL2 (I couldn't get it to run on Windows, so this is the next best thing):
These numbers high enough to leave me a little bit skeptical. The |
Thanks @pgovind . I went through your benchmark PR dotnet/performance#1445 and added some comments to eliminate any other noise. I am OK merging it at this point. Just want to double check with @echesakovMSFT on his thoughts on running benchmarks on WSL2. |
CI failures are unrelated. Merging. Thanks for the reviews everyone! |
If you don't mind, can you have open a follow-up issue to optimize the code as per Tamar's suggestions? |
Yup, filed #40805 |
Can be reviewed now. I'll add perf numbers in a bit.
Implements
GetIndexOfFirstNonAsciiByte
from #35034